Research questions / Problem statement:

Infectious diseases are a very important public health issue. So we want to examine overall communicable disease rates and trends over time of Infectious diseases reported in california. Sexually transmitted diseases will be analized separately from other groups of infectious diseases.

Datasets to be used :

1.Infectious Diseases by County, Year and Sex (in California)2001-2018 Source : https://data.chhs.ca.gov/dataset/infectious-disease Raw format of dataset: https://data.chhs.ca.gov/dataset/03e61434-7db8-4a53-a3e2-1d4d36d6848d/resource /75019f89-b349-4d5e-825d-8b5960fc028c/download/idb_odp_2001-2018.csv Name/source: CHHS Open Data Number of columns: 9 Number of rows: 154,344 Timing: The years included in this dataset is 2001 to 2018

2.STD’s in California by disease, county, year and sex. Dataset: case counts and rates for sexually transmitted diseases (chlamydia, gonorrhea, and all forms of syphilis) reported for California residents. https://data.chhs.ca.gov/dataset/stds-in-california-by-disease-county-year-and-sex

Name/Source: CHHS Open Data Number of Columns: 10 Number of Rows: 9,558 Timing: The years included in this dataset is 2001 to 2018

Data Cleaning:

We looked at the dataframe from the environment plus the first five rows using the head function. Noticed the following: The “rate” column has a lot of “dashes” or missing values because of 0 cases forsome diseases. Will look at all the values to see using the function unique unique(id_data$rate), and other than a dash, we see empty cells and “SC”. We cleaned the data by eliminating empty values ," “,”-" or NA. We corrected the NA values when calling the dataframe.
na = (c(" “,”-“,”SC“,”NA"))

Creating variables for data analysis:

*We created new groups of variables to facilitate data presentation and analysis. The new groups of variables are:
1. Name of california region, for the 10 different California regions.

  1. Type of infectious disease : to group each of thereported diseases by “type of disease” , following conventional microbiology classification.

  2. We also grouped years in groups of 3.

California regions:

Superior <- “NEVADA”,“PLACER”,“PLUMAS”,“SACRAMENTO”,“SHASTA”,“SIERRA”, “SISKIYOU”,“SUTTER”,“TEHAMA”, “YOLO”, “YUBA”, “MODOC”, “EL DORADO”, “BUTTE”, “GLENN”, “LASSEN” North Coast <- “DEL NORTE”, “HUMBOLDT”, “LAKE”, “MENDOCINO”, “NAPA”,“SONOMA”, “TRINITY” Bay area<- “ALAMEDA”,“CONTRA COSTA”, “MARIN”, “SAN FRANCISCO”, “SAN MATEO”, “SANTA CLARA”, “SOLANO” North San Joaquin Valley <- “ALPINE”, “AMADOR”, “CALAVERAS”, “MADERA”,“MARIPOSA”, “MERCED”, “MONO”,“SAN JOAQUIN”, “STANISLAUS”, “TUOLUMNE” Central Coast <- “MONTEREY”, “SAN BENITO”, “SAN LUIS OBISPO”, “SANTA BARBARA”, “SANTA CRUZ”, “VENTURA” South San Joaquin Valley <- “FRESNO”,“INYO”, “KERN”, “KINGS”, “TULARE” Inland Empire<- “RIVERSIDE”, “SAN BERNARDINO” LA County <- “LOS ANGELES” Orange County <- “ORANGE” San Diego and Imperial County <- “IMPERIAL”, “SAN DIEGO” We will also have “California” as a total for the State.

#Groups of infectious diseases: 1. Parasitic <- c(“Amebiasis”,“Babesiosis”, “Cryptosporidiosis”, “Cyclosporiasis”, “Cysticercosis or Taeniasis”, “Malaria”, “Giardiasis”, “Trichinosis”) 2. Toxin_related <- c(“Botulism, Foodborne”,“Botulism, Other”, “Botulism, Wound”, “Ciguatera Fish Poisoning”, “Domoic Acid Poisoning”,“Paralytic Shellfish Poisoning”, “Scombroid Fish Poisoning”) 3. viral <- c(“Chikungunya Virus Infection”, “Dengue Virus Infection”,“Flavivirus Infection of Undetermined Species”,“Hantavirus Infection”,“Hepatitis E acute infection”,“Rabies, human”,“Yellow Fever”, “Zika Virus Infection”) prions <- c(“Creutzfeldt-Jakob Disease and other Transmissible Spongiform Encephalopathies”) 4. fungal <- c(“Coccidioidomycosis”) 5. Bacterial <- c(“Anaplasmosis”, “Anaplasmosis and Ehrlichiosis”, “Anthrax”, “Brucellosis”, “Campylobacteriosis”,“Cholera”,“E. coli O157”,“E. coli Other STEC (non-O157)”, “Legionellosis”,“Leprosy (Hansen’s Disease)”, “Leptospirosis”, “Listeriosis”, “Lyme Disease”,“Plague, human”,“Q Fever”,“Spotted Fever Rickettsiosis”, “Streptococcal Infection (cases in food and dairy workers)”, “Ehrlichiosis”, “Psittacosis”, “Salmonellosis”, “Shigellosis”, “Tularemia”, “Typhoid Fever”, “Paratyphoid Fever”, “Typhus Fever”, “Relapsing Fever”, “Shiga toxin-producing E. coli (STEC) without Hemolytic Uremic Syndrome (HUS)”, “Vibrio Infection (non-Cholera)”, “Shiga Toxin Positive Feces (without culture confirmation)”,“Yersiniosis”) 6. Infectious_complications <- c(“Hemolytic Uremic Syndrome (HUS) without evidence of Shiga toxin-producing E. coli (STEC)”,“Hemolytic Uremic Syndrome(HUS)”, “Shiga toxin-producing E. coli (STEC) with Hemolytic Uremic Syndrome (HUS)”)

Grouping by Three Year Incriments:

“2001-2003”, “2004-2006”, “2007-2009”, “2010-2012”, “2013-2015”, “2016-2018”

Analytic Methods:

We are reporting descriptive data using frequency analysis.

Results (Tables and Figures)

Infectious disease rates over time in the California from 2001-2018 by etiology of disease and time period (3 year cummulatives)
Time_Period Bacterial Fungal Parasitic Viral
2001-2003 36.60314 4.896628 10.021440 0.0402668
2004-2006 34.10182 7.786118 8.816448 0.0277256
2007-2009 34.39147 7.002130 8.043632 0.0532329
2010-2012 36.38063 12.194507 7.007474 0.1979917
2013-2015 41.56742 7.590643 7.080101 0.3491155
2016-2018 47.77543 17.444321 9.291980 0.9902012
## `summarise()` regrouping output by 'ID_type' (override with `.groups` argument)
## New names:
## * NA -> ...4
Infectious disease rates over time in the Bay Area from 2001-2018 by etiology of disease and time period (3 year cummulatives)
Time_Period Bacterial Fungal Parasitic Viral
2001-2003 9.469341 0.0906707 3.497415 0.0076792
2004-2006 8.555531 0.1987058 2.703240 0.0055438
2007-2009 8.320628 0.2248960 2.372584 0.0162190
2010-2012 8.911663 0.2793794 2.231737 0.0608398
2013-2015 10.012628 0.2983489 2.202779 0.1109373
2016-2018 11.155957 0.6190794 3.088959 0.3594545

#Figures and codes:

Figure 1 shows that of the reported infectious diseases (excluding sexually transmitted diseases) that are most commonly reported are Bacterial diseases, followed by Fungal, and then parasitic diseases. Viral diseases have a lower rate. These numbers do not necessarily translates into real prevalence since many diseases are not considered “reportable”, due to their common prevalence and ubiquitous distribution. Ingeneral thorugh the years the frequency of reported bacterial, Fungal and viral diseases have increased, while Parasitic have decreased, except for 2016-2018 that shows an increasing trend.

## `summarise()` regrouping output by 'ID_type' (override with `.groups` argument)
## New names:
## * NA -> ...4
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Figure 2: This figure shows that since 2001, reports of bacterial diseases have increased overtime. Reasons for this increase could be related to a real increase of reportable cases, versus improved report methodology. The same goes to Fungal infections. Parasitic infections have decreased, except forthe period 2016-2018 that show an increase. Viral infection reports have increased since 2016 due to new viral reportable conditions like Zika and Chikungunya .

Table # 3 : Rate/100,000 of number of reportable infectious diseases per year in California during 2001-2018, by disease type
Type of Infectious Disease reported
Rate/100,000
year bacterial fungal parasitic viral
2001 37.21 4.32 11.73 0.08
2002 38.47 4.60 9.69 0.01
2003 34.13 5.77 8.64 0.03
2004 33.75 7.10 8.67 0.01
2005 34.35 7.89 8.88 0.03
2006 34.21 8.38 8.90 0.04
2007 32.91 8.05 8.75 0.03
2008 35.54 6.48 7.77 0.02
2009 34.73 6.47 7.61 0.11
2010 36.99 11.88 7.27 0.22
2011 33.39 13.87 6.91 0.11
2012 38.76 10.83 6.84 0.26
2013 38.32 8.65 6.95 0.34
2014 41.53 5.99 6.79 0.34
2015 44.85 8.13 7.50 0.37
2016 43.02 14.13 8.73 1.69
2017 48.56 19.33 9.45 0.77
2018 51.75 18.87 9.70 0.51
Data Sources
Data from https://data.chhs.ca.gov/dataset/infectious-disease

Table 3 : This table shows the values of the reported cases/100,000 by infectious disease type during 2001-2018 (same as Figure 2 )

Figure 3 : Among the bacterial infections, the most commonly reported one is Campilobacteriosis, followed by Salmonellosis and Shiguellosis.

Figure 4: The most common parasitic disease is Giardiasis, followed by Amebiasis and cryptosporidiosis.

Figure 5 : Among viral infections, the most commonly reported was Dengue virus infection. The newly described virus Chikungunya and Zika virus were not reported in California until 2017

Analyzing STDs:

Methods for Figure 6:

The visualiztion that I would like to create is a graph rates of each disease type (bacteria, virus, std etc) overtime within the Bay Area. Steps to do this were:

  1. I eliminated column 10 in the STD dataset so that the columns in the two datasets had the same column names.
  2. I filtered both data sets to bay area counties only (“ALAMEDA”, “SANTA CLARA”, “SAN MATEO”, “SAN FRANCISCO”, “MARIN”, “CONTRA COSTA”, “SOLANO”).
  3. I merged the two datasets into one dataset using rbind and converted all the county names and sex into lowercase (using tolower()).
  4. I filtered sex to total
  5. Created a new variable known as DIsease_Type which categorized each entry into Bacteria, Virus, STD, Fungal, Protozoa, Toxin, Prion, or Infectious Complication
  6. Deleted the rate column (I created my own rate)
  7. Created a new rate per year per disease type by first grouping by Disease_Type and Year then summarized by taking the sum of Cases (sum_case) to get the total cases per year per Disease type. Then I created a new dataset by this code to get the population per year in the bay area: group_by(County, Year)%>%summarize(total_pop=mean(Population))%>%group_by(Year)%>%summarise(totalp=sum(total_pop)).Finally, I merged the above two datasets using left_join and created a new overall rate with this code:Dataset\(Overall_Rate<-(Dataset\)Sum_cases/Dataset$totalp)*100000

Interpretation of graph: This graph looks at the overall rates per year of different types of infectious disease from 2001 to 2018 in the Bay area counties, with the Y axis adjusted. I created this graph to better visualize the trends. From the graphs it is noticeable that fungal and bacterial rates are increasing overtime, along with STD rates.

Methods for Figure 7:

The goal was to visual trends of STD rates within the Bay Area over the years included in the dataset 2001-2018, seperated by sex. The steps I used to accomplish this were:

  1. I took the STD dataset, filtered to Bay area counties, and filtered the Sex column to rows that only contained “male” or “female”
  2. Filtered out the rate column.
  3. Created by own Overall Rate column that was the rate of all STDs per year by sex by using this code to get total number of cases per year by sex: group_by(Sex, Year)%>%summarise(case_total=sum(Cases)). Then I used the previous dataset from figure 6 that had total population per year. Finally I merged the two datasets using left_join and created a rate by taking (cases/pop)*100,000

Interpretation of graph: This graph looks at the Overall rates per year of bacterial STD infectious disease in the Bay area counties from 2001 to 2018, seperated by Sex. I created this graph to better visualize the trends in STDs between males and females.The graphs shows a very significant increase in the overall rate of STDs for both male and females. Prior to around 2014, it seems that female rates were higher than male rates. However from around 2014 and onward, we see an even greater increase in male rates.

Methods for Table 4:

The goal was to visual the overall rate of each STD (Chlamydia, Gonorrhea, Syphilis) within the Bay Area over the years 2014-2018, in a table. These are the following steps I took to accomplish this:

  1. I took the STD dataset, filtered to Bay area counties, and filtered the Sex column to rows that only contained “total”
  2. Created by own Overall Rate column that was the rate of each STD per year: by using this code to get total number of cases of each STD per year by sex: group_by(Disease, Year)%>%summarise(STD_case_total=sum(Cases)). Then I used the previous dataset from figure 7 that had total population per year, Finally I merged the two datasets using left_join and created a rate by taking (cases/pop)*100,000
  3. Selected Year, Disease, and Overall Rate Columns
  4. Pivoted (pivot_wider) the dataset so the years became the column and the diseases became the rows
  5. Selected the columns Disease and Years 2014-2018
  6. Used the kable function to create a datatable
Table 4: Rate (per 100,000) of STDs in the Bay Area from 2014-2018
Disease 2014 2015 2016 2017 2018
Chlamydia 410.81 468.22 486.89 537.19 571.75
Early Syphilis 26.20 29.67 31.24 38.29 41.31
Gonorrhea 132.26 168.62 189.40 217.44 222.31

Interpretation of table: This table is a visualization of the rates of STDs in the Bay Area over the last 5 years that were included in the dataset (2014-2018). As shown in the table, rates of all three STDs are increasing significantly each year. Chlamydia rates are also much higher than rates of early syphilis and Gonorrhea

Discussion: